2022-05-10

Data set overview

  • Dimensions of the raw data set: 392, 13

  • Stratified on Controls and PCa cases (attribute called Group)

  • Purpose of article: Predict PCa from other variables, mainly mtDNA

Cleaning and augment of data set

Cleaning

  • Check for duplicates

  • Filter for pcr_success

  • New dimensions: 387, 13

Augmenting

  • bmi- and dfi-classifier

  • New columns based on TNM-notation

  • Add “group” as strings

  • New dimensions: 387, 19

Boxplot with continuous variables, any outliers?

<<<<<<< HEAD

Boxplot with discrete variables, any outliers?

=======

Boxplot with discrete variables, any outliers?

>>>>>>> f13c2d8a0218021fe92a3fac9619b9a76a626897

Re-creating plot from the article

<<<<<<< HEAD Article visualizationArticle visualization ======= Article visualizationArticle visualization >>>>>>> f13c2d8a0218021fe92a3fac9619b9a76a626897

Article visualization

A better biomarker for PCa?

Interesting finding during exploratory data analysis

Logistic regression, excl. PSA

Significant p-values:
Maybe the distribution of Dfi-classes are skewed?

Logistic regression, incl. PSA

Significant p-values:

Principal component analysis (PCA)

PCAPCAPCA

PCA

Conclusion

  • We can support the conclusion of the article, mtDNA is a biomarker for PCa (e.g, it is reproducible)
  • PSA levels seem to be an even better biomarker
  • Both of the above could be supported by logistic regression
  • Conclusion for PCA?